Resilience-Aware Resource Management for Exascale Computing Systems

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using replication for resilience on exascale systems

High performance computing applications must be tolerant to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-rollback, by which the application saves its state to secondary storage throughout execution and recover from the latest saved state in case of a failure. An oft studied research question is that of the opt...

متن کامل

Toward Exascale Resilience

Over the past few years resilience has became a major issue for HPC systems, in particular in the perspective of large Petascale systems and future Exascale ones. These systems will typically gather from half a million to several millions of CPU cores running up to a billion of threads. From the current knowledge and observations of existing large systems, it is anticipated that Exascale system...

متن کامل

Using group replication for resilience on exascale systems

High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the o...

متن کامل

Combining Process Replication and Checkpointing for Resilience on Exascale Systems

Processor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback, has been recently adv...

متن کامل

A Genetic Based Resource Management Algorithm Considering Energy Efficiency in Cloud Computing Systems

Cloud computing is a result of the continuing progress made in the areas of hardware, technologies related to the Internet, distributed computing and automated management. The Increasing demand has led to an increase in services resulting in the establishment of large-scale computing and data centers, in addition to high operating costs and huge amounts of electrical power consumption. Insuffic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Sustainable Computing

سال: 2018

ISSN: 2377-3782,2377-3790

DOI: 10.1109/tsusc.2018.2797890